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Networks  and  Algorithms  for  Very  Large  Scale  Parallel  Computation 
by  Allan  Gottlieb  and  J.T.  Schwartz 
Computer  Science  Department,  Courant  Institute, 
New  York  University* 

1.0   INTRODUCTION 

The  continuing  rapid  progress  of  VLSI  technology  is 
beginning  to  make  possible  the  construction  of  very  large  scale 
parallel  computing  assemblages,  i.e.  computing  systems  in  which 
tens  or  even  hundreds  of  thousands  of  arithmetic  devices 
cooperate  for  the  rapid  solution  of  certain  problems.  Even 
though  parallel  computers  of  this  type  have  been  contemplated  for 
many  years  (see  Kuck  [75],  Barnes  [68],  Schwartz  [66])  and  fairly 
large  parallel  machines  such  as  the  ILLIAC  IV  and  the  ICL  DAP 
have  been  made  operational,  recent  technological  progress  has 
lent  new  interest  to  this  area,  which  has  recently  begun  to 
attract  the  attention  of  increasing  numbers  of  university  and 
industrial  researchers.   Various  lines  of  attack  have  emerged. 

One  line  of  work  (see  Kung  [79],  Kung  [80],  Thompson  [80], 
Browning  [80])  focuses  on  the  great  economic  and  speed  advantages 
that  can  be  gained  by  designing  algorithms  that  conform  well  to 
the   restrictions   imposed   by  VLSI  technology,  and  in  particular 
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algorithms  and  parallel  system  architectures  that  lay  out  well  in 
two  dimensions.  These  studies  aim  at  the  design  of  powerful 
special-purpose  chips  and  of  systems  small  enough  to  reside  on  a 
single  chip. 

A  second,  somewhat  more  conventional  approach,  to  which  the 
work  reported  in  the  present  paper  belongs,  expects  to  make  use 
of  high-performance  but  otherwise  -staiidaT'd  Jiicroprocessor  chips, 
tightly  coupled  via  a  suitable  rietw~6!rlsr'."'' C<>:i tx'di  to  this  approach 
is  the  expectation  that  single-chip  p^V oca s 3 o r s  able  to  execute 
instructions  at  a  20  megacycle  race,  and  also  megabit  memory 
chips,  will  be  available  in  quantity  by  the  end  of  the  present 
decade.  As  will  be  seen,  the  pragmatic  desirability  and 
possibility  of  using  modified  versions  of  presently  existing 
programming  languages  to  program  large  parallel  machines  is 
another  feature  of  the  work  to  be  reported  in  this  paper. 


Yet  a  third  emphasis,  specifically  on  architectures  derived 
from  very  general,  abstract  "data-flow"  models  of  parallel 
computation,  has  been  pursued  by  other  researchers,  especially  at 
MIT;  see  Dennis  and  Misunas  [75]  and  Arvind  [78]  et  al.  for 
example.  Recent  work  along  these  lines  has  stressed  the  possible 
advantages  of  a  purely  applicative,  side-effect-free  programming 
language  for  the  description  of  parallel  computation  (see  also 
Mago  [79]). 
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These  three  research  efforts  lead  to  machines  suited  for 
different  environments.  Kung ' s  systolic  arrays  should  be  most 
useful  for  well  defined  fixed  tasks  such  as  the  kernel  of  certain 
signal  processing  applications.  However,  these  arrays  may  be 
hard  to  adapt  when  the  algorithms  change  or  when  many  dirrerent 
cases  must  be  considered, 


^  ^  —  -*  - 


Although  dataflow  m^ol^jijBfi-  havfr,  been  discussed  for  several 
years  no  optimal  architecture;.!!;^:  ^-jf^t  emerged.  In  section  S,  we 
show  how  a  dataflow  languag-e;  jnay  be  executed  with  maximum 
parallelism  on  the  more  conventional  parallel  machines  described 
below. 
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2.0   CONNECTION  NETWORKS 

A  crucial  point  in  the  design  of  any  highly  parallel 
computer  consisting  of  multiple,  relatively  standard 
microprocessors  is  the  nature  of  the  network  that  will  be  used  to 
interconnect  them.  In  selecting  a  network,  an  essential 
criterion  is  that  is  should  be  physically  feasible .  i.e.  conform 
to  the  restrictions  on  what  is'Euildable  imposed  by  electronic 
technology.  A  simple  way  of  modeling  these  restrictions 
abstractly  is  to  insist  that'  each  node  of  any  feasible 
interconnection  network  should  be  connectea  to  no  more  than  a 
fixed  number  C  of  other  nodes,  independent  of  the  total  number  N 
of  nodes  in  the  network  (N  can  grow  very  large).  A  second 
significant  criterion  is  that  the  network  should  be 
asymptotically  optimal .  i.e.  should  be  able  to  perform  various 
fundamental  computational  operations  on  a  collection  of  N  data 
items  (initially  distributed  one  per  processor)  as  rapidly  as  is 
theoretically  possible. 

To  make  this  latter  condition  more  specific,  a  bit  of 
theoretical  analysis  is  in  order.  If  we  suppose  that  the 
processing  elements  (PEs)  attached  to  (and  perhaps  constituting) 
the  nodes  of  a  communication  network  are  standard  microcomputers, 
then  during  each  cycle  of  operation  at  most  two  inputs  can  be 
combined  at  any  single  processing  node.  Thus  after  2,3>...k 
cycles  of  operation  we  will  be  able  to  produce  outputs  depending 
on   at   most   4,8...,2»*k  inputs.   It  follows  that  to  produce  any 
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single  output  depending  on  every  one  of  N  data  inputs,  at  least 
O(log  N)  cycles  of  operation  are  required,  no  matter  in  what 
pattern  we  connect  the  processors  of  a  parallel  assemblage.  Any 
connection  pattern  which  meets  this  limit  (when  calculating  a 
particular  function  of  N  inputs)  is  asymptotically  optimal  (for 
that  calculation).  "   ------- 


It  is  significant  that  several  connection  patterns  which  are 
asymptotically  optimal  for  a  .w^<|e  range  of  important  computations 
are  known.  Functions  for  wljich  optimal  parallel  computations  are 
available  include  Fast  Fourier  Transform  (Pease  [68]),  arbitrary 
permutation  of  N  data  items  (Clos  [53],  Benes  [65]),  and  others. 
A  partial  summary  of  results  of  this  kind  obtained  to  date  is 
found  in  Table  1  below  (taken  from  Schwartz  [80]).  To  clarify 
the  nature  of  these  asymptotically  optimal  networks,  we  will 
consider  a  particularly  simple  problem:  that  of  forming  the  sum 
of  N  given  quantities  xO  ,  . . . , x( N- 1  )  ,  which  we  suppose  to  be 
distributed  among  N  processing  elements  when  addition  commences. 
To  avoid  technical  complications,  we  also  suppose  that  N  is  a 
power  of  2,  specifically  N  =  2*»D. 

The  easiest  network  on  which  to  perform  parallel  addition  of 
N  quantities  is  a  connection  pattern,  namely  the  D-dimensional 
hypercube  network  (see  Sullivan  [77]  and  also  Fig,  1),  which  does 
not  quite  meet  the  condition  that  the  number  of  neighbors  of  each 
of  its  nodes  is  independent  of  N.  To  describe  this  network, 
suppose   that   the   N   processors   constituting   the   network  are 
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numbered  0..N-1,  i.e.  are  given  D-bit  identifiers  bD..b1.  Then 
the  n-th  and  the  m-th  processors  are  directly  connected  in  the 
hypercube  network  if  and  only  if  n=m+2*»j  for  some  j  between  0 
and  D-1,  i.e.  if  and  only  their  D-bit  identifiers  differ  by  a 
single  bit.   (see  Fig.   1)        /  j.  ; 

To  perform  addition  of   N   quantities   on   a   network   of   N 
processors  connected  in  this  way,  we  can  proceed  as  follows: 

1.  First,  add  x(i+1)  to  x(i)  for  each  even  i  and  put  the 
result  in  x(i); 

2.  On  the  k-th  subsequent  cycle,  for  each  i  that  is  a 
multiple  of  2**k,  we  add  x ( i+2»* (k- 1 ) )  to  x(i)  and  put 
the  result  in  x ( i ) ; 

3.  After  D  cycles,  the  required  sum  will  be  present  in 
x(0). 


Much  the  same  process  can  be  carried  out  on  at  least  two 
other  networks,  namely  the  perfect  shuffle  network  (Stone  [71]) 
and  the  cube-connected  cycle  network  (Preparata  and  Vuillemin 
[79])  each  of  which  is  considerably  more  economical  in  its  use  of 
connections  than  is  the  full  hypercube.  In  the  shuffle  network 
(see  fig.  2),  processor  j  is  connected  to  processor  j+1  and 
processor  j-1,  and  also  to  processor  shuf(j)  and  to  processor 
ishuf(j),  where  shuf(j)  (resp.  shuf(j))  designates  the  result  of 
rotating  the  D-bit  representation  of  j   left   (resp.   right)   one 
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bit.  Note  that  the  shuffle  connection  between  j  and  shuf(j) 
enables  us  to  bring  the  bits  of  j  one  after  another  into  the 
least  significant  position.  This  means  that,  for  any  algorithm 
that  uses  the  hypercube  connections  between  j  and  j+2*»k  only  in 
strictly  ascending  (or  strictly  descending)  order  of  k,  the 
shuffle  network  will  perform  as  well  as  the  full  hypercube 
network.  But  in  the  shuffle  network  each  node  has  only  4 
connections,  whereas  in  the  hypercube  each  node  has  log  N 
connections.  ^a  -;:t 

The  summing   operation   can   be   performed   in   the   shuffle 
network  as  follows: 

1.  Put  Limit  =  N; 

2.  For  all  even  j  from  0  to  Limit-1,  add  x(j+1)  to  x(j)  and 
store  the  result  in  x(j); 

3.  Move  x(j)  to  x(ishuf(j))  for  all  j; 

4.  If  Limit>1,  Divide  it  by  2,  and  iterate  steps   (2)   thru 
(4). 


The  cube-connected  cycle  network  is  obtained  by  taking  the 
hypercube  and  expanding  each  of  its  N  nodes  to  a  "cycle"  of  D  = 
log  N  nodes  numbered  0  thru  D-1  (see  Fig.  3).  Thus  the  nodes  of 
this  network  are  in  1-1  correspondence  with  the  set  of  all  pairs 
[i,j],  where  i  ranges  from  0  to  D-1,  and  j  ranges  from  0  to  N-1  = 
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2»»D-1.    The   node   [i,j]   is   connected  to  [i-1,j]  and  [i+1,j], 

which  makes  the  set  of  all  nodes  [i,j]  with  fixed  j  a  cycle,   and 

is  also  connected  to  [i,j+2»*i]  (if  the  i-th  bit  of  j  is  zero)  or 

to  [i,j-2**i]  (if  the  i-th   bit   of   j   is   1)   giving   cube-like 

connections   among   all   the   nodes  [i,j]  with  fixed  i.   Plainly, 

each  node  in  this  network  has  exactly _three  neighbors,  and   there 

are  exactly  D*N  nodes. 

"  ?-"t  has    '    v/i    ■*  ■•.  '• 

In  this  network,  a  shuffle-like  ^ff'ct  is  obtained  simply  by 

moving  data  from  [i,j]  to  [i+1,j].   Suppose,  for  example,  that  an 

array  x(i,j)  of  D^N  quantities  is  given  and  must  be  summed.    For 

a   fixed   value   of   i,  say  i=0,  we  can  do  this  using  exactly  the 

"hypercube"  algorithm  discussed   above,   provided   that   we   move 

x(i,j)   to   x(i+1,j)   between  successive  summing  steps.   But  this 

process  can  be  applied,  essentially  in  parallel,  to  form  the  sums 

of   all   the  "cross-section"  arrays  x(1,j),  x(2,j)  ,...,  x(D-1,j) 

(j=0..N-1)  at  the  same  time.   This  makes  relatively  full   use   of 

the   connections   available   in  the  cube-connected  cycle  network. 

Once  these  D  partial  sums  have  been  formed,  they   can   easily   be 

combined   in   an   additional   D   =   log  N  steps  using  the  "cycle" 

connections  between  the  nodes  [j,0]  and  [j+1,0]. 


In  more  detail,  the  summing  operation  can   be   performed   in 
the  cube-connected  cycle  network  as  follows: 
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1 .  Put  Limit  =  1 ; 

2.  For  each  [i,j]  for  which  i  lies  between  0  and  Limit-1 
and  j  is  a  multiple  of  2»»(i+1),  add  x(i,j+2*«i)  to 
x(i,j)  and  put  the  result  in  x(i,j); 

3.  Move  x(i,j)  to  xM*-pV,  J.)^ 'fiQpircularly  in  i); 

4.  Increment  Limit  by  1  and  repeat  steps  (2)  thru  (4)  until 
they  hae  been,  peri-^raecfe  5X -times  ; 

5.  Put  Limit  =  2; 

6.  For  each  [i,j]  for  which  i  lies  between  Limit  and  D-1 
and  j  is  multiple  of  2«*(i+1),  add  x(i,j+2»«i)  to  x(i,j) 
and  put  the  result  in  x(i,j); 

7.  Move  x(i,j)  to  x(i+1,j)  (circularly  in  i); 

8.  Increment  Limit  by  1  and  repeat  steps  (6)  through  (8) 
until  they  have  been  performed  D-1  times; 

9.  Sum  x(i,0)  for  i=0,...D-1,  putting  the  result  (which  is 
the  sum  of  all  the  original  values  x(i,j))  into  x(0,0). 

The  following  table,  taken  from  Schwartz  [80],  lists  various 
other  algorithms  that  can  be  performed  efficiently  on  either  the 
perfect  shuffle  or  cube  connected  cycle  networks.  Detailed 
references  to  the  works  cited  in  Table  1  can  be  found  in  Schwartz 
[80]. 
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3.0   THE  "DYNAMIC  PERMUTATION  PROBLEM 


For  the  considerations  occupying  the  remainder  of  the 
present  article,  the  significant  entries  in  Table  1  are  items  (1) 
and  (12).  Given  an  arbitrary  permutation  (and  time  to 
pre-analyse  it,  specifically,  to  factor  it  into  a  product  of 
permutations  and  shuffles),  data  can  be  moved  in  the  manner 
defined  by  this  permutation  using  only  O(log  N)  parallel  steps. 
However,  to  perform  the  r^qui're'd  prt-analysis  requires 
O((log  N)*»4)  steps  using  the  best  method  currently  known  (cf. 
item  (12)  of  Table  1).  Hence  this  approach  solves  only  the 
"static"  data  motion  problem,  namely  the  problem  of  moving  data 
efficiently  in  a  pattern  known  in  advance,  not  the  "dynamic" 
problem  of  moving  data  in  the  unpredictable  patterns  generated  by 
memory  requests  in  a  large  parallel  array  of  processors.  However 
an  effective  "average  case"  solution  of  this  dynamic  problem  can 
be  given,  and  this  solution  justifies  consideration  of  a  more 
user-friendly  model  of  highly  parallel  computation,  namely  the 
"paracomputer "  model,  in  which  an  indefinitely  large  number  of 
processors  all  communicate  freely  with  a  single  large  memory. 


The  FMP,  a  512  processor  machine  recently  described  by 
Burroughs  Corp.  in  response  to  NASA's  "Numeric  Aerodynamic 
Simulation  Facility"  requirement,  includes  the  following 
interesting  engineering/algorithmic  idea  to  realize  dynamically 
generated  data  mappings  rapidly  without  pre-analysis .  Let  a 
function   n-->p(n)   be   given   and   assume   that   a   vector   V  is 
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distributed  across  the  ul tracomputer  with  V(n)  located  in  PEn. 
To  move  V  so  that  V(n)  comes  to  be  located  in  PEp(n),  apply  a 
sequence  of  shuffles  (moving  V(n)  to  shuf(n),  shuf ( shuf ( n  )  )  , 
etc.)  and  whenever  the  low  order  bit  of  shuf**k(n)  and 
shuf **k ( p ( n ) )  differ,  apply  an  exchange  to  correct  this  bit. 
After   9   =   log  512   steps   the   elements   of   V   will  have  been 

redistributed  as  desired. 

;  '  '.■:■  7  'no  ^r  ':■■:. 

Since  in  the  prop.osed  handware.  no  two  items  are  allowed  to 
move  into  the  same  positto,n,-^Q>Qntention  will  sometimes  develop 
between  pairs  of  processors.  Thus  the  rule  for  exchanging  data 
items  must  actually  be  as  follows  (note  that  the  low  order  bit  of 
shuf»«k(n)  is  bit  k  of  n): 

1.  Take  a  pair  of  integers  n,  n'  differing  only  in  their 
k-th  bit;  call  n  (resp.  n')  k-wrong  if  the  k-th  bit  of 
n  differs  from  the  k-th  bit  of  p(n)  (resp.  p(n)'). 

2.  Cycle  in  parallel  through  the  bits  of  each  n,  in  the 
usual  way  using  shuffles.  If  (on  the  k-th  cycle)  both  n 
and  n'  are  k-wrong,  then  interchange  V(n)  and  V(n').  If 
neither  n  or  n'  is  k-wrong,  do  not  interchange.  If  just 
one  of  the  two  items  n,  n'  is  k-wrong,  do  not 
interchange,  but  change  the  k-wrong  item  to  nil 
indicating  that  this  value  will  not  reach  its  proper 
destination  and  will  have  to  be  retransmitted  later. 
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3.  If  both  items  of  a  pair  are  nil,  then  do  not 
interchange.  If  only  one  item  is  nil,  then  interchange 
if  the  non-nil  item  is  k-wrong,  otherwise  do  not 
interchange . 

Suppose  that  the  permutation  p  is  ;  elect. ed  at  random,  and 
that  after  the  k-th  cycle  of  the  above  iteration  a  fraction  x(k) 
of  the  V(n)  remain  non-nil.  This  means,  in  particular,  that  for 
an  N  =  2**D  processor  system,  x(D)  is  the  fraction  of  successful 
requests,  i.e.  requests  that  are  successfully  transmitted  through 
the  network  to  their  desired  destination.  An  item  I  will  be 
nilled  only  if  it  belongs  to  a  pair  of  items  neither  of  which  is 
nil,  and  only  if  I  is  k-wrong  and  its  partner  is  not.  The 
probability  of  this  event  is 

(  1/2)(l/2)x(k)»«2  =  (l/iOx(k)«»2 
so  that  x(k)  satisfies  the  recurrence 

x(k+1  )  =  x(k)-( 1/4)x(k)»«2. 


The  sequence  generated  by  this  recurrence  is 

0  1.000  5   .399  10   .259 

1  .750  6   .359  11 


2  .609 

3  .517 

4  .450 


7  .327 

8  .300 

9  .278 


242 

12  .227 

13  .214 

14  .203 
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Thus  in  the  5 1 2-processor  FMP  system  28$  of  all  requests 
will  be  transmitted  successfully  through  all  9  shuffle  steps;  in 
a  16k-processor  system,  20$  of  all  requests  will  be  transmitted 
successfully.  For  the  latter  system,  if  each  processor  randomly 
generates  one  memory  reference  per  microsecond,  a  3.2  gigacycle 
memory  bandwidth  is  achieved.  .'-  "  n 

The  Burroughs  architecture  employs  9  stages   of   256   simple 

"...  .  i  .T  £  a  a  e  ■ 
two-input   two-output   switches   to  perform  exchanges  and  effects 

the  (inverse)  shuffle  maps  by  wiring  between  stages.   This   leads 

to    the    familiar    interconnection   network   of   Lawrie   [75], 

illustrated  (for  16  processors)  in  Fig.  1. 


In  such  a  network  P  =  2»*D  PEs  communicate.  To  describe  the 
network  in  detail,  we  shall  suppose  that  the  PEs  are  numbered 
using  D-bit  identifiers  whose  values  range  from  0  to  P-1,  and 
that  the  binary  representation  of  an  identifier  x  is  written 
xD...x1.  Then  the  network  consists  of  D  stages  of  2  X  2  switches 
with  adjacent  stages  connected  via  the  inverse  of  the  shuffle 
map.  The  PEs  are  directly  connected  to  the  first  stage  of 
switches  and  are  connected  via  the  inverse  shuffle  to  the  last 
stage  of  switches.  Each  of  the  P/2  2X2  switches  constituting  a 
single   network   stage  can  separately  transmit  data  in  one  of  two 

modes,  "straight",  in  which  its  left-hand  top  and  bottom 
terminals  are  connected  to  the  right-hand  top  and  bottom 
terminals  respectively,  and  "crossed",  in  which  its  left-hand  top 
and   bottom   terminals  are  connected  to  the  right-hand  bottom  and 
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top  terminals  respectively  (see  Fig.  5).  The  crossed  (resp. 
straight)  state  is  used  when  one  wishes  (resp.  does  not  wish)  to 
interchange  items. 

The  bandwidth  of  the  connection  network,  which  the  data 
given  above  show  to  be  reasonably  large,  can  be  increased  by 
duplexing  (or  quadruplexing )  and  putting  2  vcr  i;  )  copies  of  each 
data  item  on  the  network,  at  2  (or  4)  diftorent  priority  levels. 
If  this  is  done,  then  the  pex-centage  of  requests  handled 
successfully  rises  as  follows  (provided  that  the  pattern  of 
request  addresses  remains  random): 

512  PES,  duplexed  switch:   48$  successful  requests 

512  PEs,  quadruplexed  switch:   73$  successful  requests 

16K  PEs,  duplexed  switch:   36$  successful  requests 

16K  PEs,  quadruplexed  switch:   60$  successful  requests. 


Somewhat  better  results  can  be  obtained  by  bringing  together 
two  pairs  of  data  items,  rather  than  simply  two  items,  at  each 
switch,  and  nilling  a  data  item  only  when  more  than  two  items  at 
a  given  switch  wish  to  proceed  to  the  same  successor  switch. 
Again  consider  a  random  mapping,  and  let  x(k)  be  the  fraction  of 
items  which  are  non-nil  when  the  k-th  bit  of  all  mapping 
addresses  are  being  examined.  Then  the  expected  number  of  items 
that  will  become  nil  per  group  of  4  is 

1  •  probability  that  3  non-nil  items  appear  in  the  group 
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»  probability  that  all  3  have  the  same  destination 

+2  »  probability  that  H    non-nil  items  appear  in  the  group 

*  probability  that  all  4  have  the  same  destination 

+1  »  probability  that  H    non-nil  items  appear  in  the  group 

•  probability  that  3  or  4  have  same  destination 

-,  c  ^  i:  -; 

Kruskal  and  Snir  [-81  J  hav;.©  .t%bown  that,  since  items  are  not 
nilled  independently,  oiie-  must  use-  a  double  recurrence  to  analyze 
the  pattern  of  survival  in -t-hi-s-cnetwork ,  leading  to  the  following 
survival  figures:  I-.o^t" 


0  1  .000 

1  .812 

2  .702 

3  .626 

4  .570 


5  .525 

6  .489 

7  .459 

8  .434 

9  .412 


10  .393 

11  .376 

12  .361 

13  .347 

14  .335 


Note  that  this  network  requires  (n/4)log  N  4X4  switches, 
half  the  number  required  by  the  simple  networks  presented  above. 
Again  the  network  can  be  duplexed  or  quadruplexed  for  improved 
performance . 

An  alternate  network,  composed  of  only  (N/4)  (1/2  log  N) 
4X4  switches  is  proposed  in  Lawrie  [75].  The  key  idea  behind 
this  network  can  be  seen  by  numbering  the  switches  in  the  first 
column  starting  with  zero  and  noting  that  the  four  outputs  of  any 
even-odd  pair  of  switches  constitute  the  four  inputs  for  two 
switches   in   the  second  column.   Thus  these  four  switches  can  be 
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combined  into  one  4  by  4  switch,  thereby  halving  the  number  of 
rows  and  the  number  of  columns  (assuming  D  is  even),  and  also 
halving  the  number  of  lines. 

An  analysis  by  Kruskal  and  Snir   [81 1   shows   that   in   this 

network   items   are   nilled  xn<tepeni  ^ntly ,    so  that  the  following 

simple  recurrence  for  number  ofv.ittams   .virviving   h   stages   of 

transmission  results:  n  3  -o   ?,  j  r.  li  i  d  .■:• 

x(k  +  2)  =  1-(  1-rvCk)/i|)"»i| 

This  recurrence  leads  to  the  following  surviva]  figures: 

0  1.000          6   .^32  12;-. 283 

2   .684          8   .367  U   .2!55 
4   .527         10   .319 


Once  again  the  network  can  be  duplexed  or   quadruplexed   for 
improved  performance. 
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H.O       PARACOMPUTERS 


If  the  PE's  forming  the  rightmost  column  of  Fig.  H  are 
replaced  by  memory  modules  (MMs)  one  obtain  a  configuration  in 
which  multiple  PE's  access  a  large  shared  memory  via  the 
connection  network.  The  nuasrical  results  presented  in  the 
preceeding  section  show  ■  tisat^-lf  *the  PE's  request  random  MMs,  the 
majority  of  these  requests  can  get  through  in  a  single  pass 
through  the  network.  We  can-"  therefore  regard  such  a 
configuration  as  approximating  the  more  ideal  kind  of  parallel 
processor,  dubbed  a  "paracomputer "  by  Schwartz  [80],  that 
consists  of  identical  PEs  sharing  a  common  memory.  The 
individual  PEs  of  such  a  paracomputer  may  also  have  attached 
local  memory,  which  we  refer  to  as  their  "private"  memories;  the 
memory  shared  by  and  common  to  all  processors  is  called  "public", 
and  variables  stored  there  are  called  "public  variables".  In  an 
idealized  paracomputer  the  PEs  can  simultaneously  read  any  public 
cell  in  one  cycle.  Moreover,  simultaneous  writes  (including  the 
replace-add  operation  described  below)  are  likewise  supposed  to 
be  effected  in  a  single  nominal  memory-access  cycle,  and  after 
all  these  writes  any  memory  cell  to  which  such  writes  are 
directed  will  contain  some  one  of  the  quantities  written  into  it. 
This  requirement  on  simultaneous  memory  updates  illustrates  the 
(paracomputer)  serialization  principle:  The  effect  of 
simultaneous  actions  by  the  PEs  is  as  if  the  actions  occurred  in 
some   (unspecified)   serial  order.   (Note  that  this  serialization 
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principle  speaks  only  of  the  effect  of  simultaneous  writes  and 
not  of  their  implementation. )  We  stress  again  that  paracomputers 
must  be  regarded  as  idealized  computational  models  since  physical 
fan-in  limitations  prevent  their  perfect  realization. 


The  remainder  of  this  report  focuses  on  the  (realizable) 
approximation  to  a  paracomputer  suggested  by  these  last 
considerations. 
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5.0   SOFTWARE  AND  SYNCHRONIZATION  ISSUES: 

THE  REPLACE-ADD  INSTRUCTION 

We  now  introduce  a  simple  yet  very  effective  interprocessor 
synchronization  operation,  called  replace-add,  which  forms  the 
basis  for  much  of  the  following  discussion  of  software  and 
synchronization  issues.  "The  format  of  this  operation  is  simply 
RepAdd(V,e),  where  V  is  an  integer  variable  and  e  is  an  integer 
expression.  We  assume  that  this  indivisible  operation  is  such  as 
to  yield  the  sum  S=V+e  as  its  value  and  to  replace  the  contents 
of  storage  location  V  by  this  sum.  Moreover,  RepAdd  must  satisfy 
the  serialization  principle  stated  above:  Assume  that  V  is  a 
public  variable  (as  it  ordinarily  will  be)  and  that  many  (perhaps 
very  many)  replace-add  operations  simultaneously  address  V.  Then 
the  effect  of  these  operations  is  exactly  what  it  would  be  if 
they  occurred  in  some  (unspecified)  serial  order,  i.e.  V  receives 
the  appropriate  total  increment  and  each  operation  yields  the 
intermediate  value  of  V  corresponding  to  its  position  in  this 
order*.  The  following  example  illustrates  the  semantics  of 
replace-add:   If  V  is  a  public  variable,  if  PEi  executes 

ANSi  < —  RepAdd(V,ei)   , 
if  PEj  simultaneously  executes 

ANSj  <--  RepAdd(V,ej)   , 
and  if  V  is  not  simultaneously  updated  by  yet  another   PEk,   then 


•  These  intermediate  values  result  from  executing  prefixes  of  the 
serialized  list  of  operations. 


either 
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or 


ANSI  <--  V+ei 
ANSj  <--  V+ei+ej 

ANSI  <--  V+ei+ej 

..J    3  •":  5  •  . .  ■ 
ANSj  <--  V+ej 

r  3  -J  K  4     .- 

and,  in  either  case,  the  value  of  V  becomes  V+ei+ej.  The  first 
possibility  corresponds  to  the  serialized  order  in  which  first 
PEi  executes  its  replace-add  and  then  PEj  executes  its 
replace-add  afterward;  the  second  possibility  corresponds  to  the 
opposite  serialization. 

It  is  also  possible  to  have  loads,  stores,  and  replace-adds 
all  concurrently  directed  at  the  same  memory  location.  Once 
again  the  serialization  principle  demands  that  the  effect  is  as 
though  these  operations  occurred  in  some  serial  order.  In 
particular,  simultaneous  loads  from  the  same  memory  location  may 
not  yield  identical  results  (since  a  simultaneous  store  or 
replace-add  may  intervene). 

The  next  section  presents  a  hardware  design  that  realizes 
the  replace-add  operation  in  essentially  the  same  execution  time 
as  a  load  or  store  to  public  memory  and  that  realizes 
simultaneous  replace-adds  updating  the  same  variable  in  a 
particularly  effecient  manner.  If  the  replace-add  operation  is 
available,  we  can  perform  many  important  algorithms  in  a 
completely  parallel  manner,  i.e.  without  using  any  critical   (and 
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hence  necessarily  serial)  code  sections.  For  example  Gottlieb  et 
al .  [80]  present  a  completely  parallel  solution  to  the 
readers-writers  problem*.  Section  7  of  the  present  report 
reviews  a  highly  concurrent  queue  management  technique  that  can 
be  used  to  implement  a  totally  decentralized  operating  system 
scheduler.  We  are  unaware  of  any  completely  parallel 
test-and-set  solutions  to  these  problems.  Kruskal  [81]  gives 
efficient  replace-add  based  implementations  of  several  other 
important  algorithms. 


•  Since  writers  are  inherently  serial,  the  solution  cannot 
strictly  speaking  be  considered  completely  parallel.  However, 
the  only  critical  section  used  is  required  by  the  problem 
specification.  In  particular,  during  periods  when  no  writers  are 
active,  no  serial  code  is  executed. 
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6.0   HARDWARE  IMPLEMENTATION 


In  this  section  we  show  how  the  replace-add  operation  can  be 
implemented  using  a  suitable  enhanced  Lawrie-type  network. 


The  way  in  which  such  a  network  is  used  to  implement  memory 
loads  and  stores  has  been  de'^TibeJ.  above.  Basically,  the 
replace-add  operation  can  be  -ceclized  in  a  topologically 
identical  network  by  augmentArg  the  MM;  and  the  network  nodes 
with  adders.  When  a  RepAdd (X , e  )  r peratio:  is  transmitted  through 
the  network  to  the  MM  containing  X,  the  value  of  X  and  the 
transmitted  e  are  brought  to  the  MM  ?.dder,  and  the  sum  is  both 
stored  in  X  and  returned  through  the  network  to  the  requesting 
PE.  However,  things  are  not  quite  this  simple.  Since  we  expect 
that  concurrent  replace-add  operations  will  frequently  reference 
the  same  memory  location,  efficient  performance  in  this  case  is 
very  important.  Fortunately,  by  including  a  few  cells  of  memory 
(and  an  adder,  as  noted  above)  in  each  switch,  the  network  can 
handle  replace-adds  with  the  same  efficiency  as  it  handles  loads 
and  stores.  (Note  that,  although  we  will  continue  to  use  the 
term  "switch"  for  the  devices  located  at  the  nodes  of  the 
enhanced  network,  the  presence  of  adders,  comparators,  etc.  in 
these  devices  brings  them  fuctionally  closer  to  microprocessors 
than  to  simple  switches.) 
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When  two  replace-adds  referencing  the  same  public  variable, 
say  RepAdd(X,e)  and  RepAdd(X,f),  conflict  at  a  switch,  we  effect 
the  serialization  order  "RepAdd(X,e)  immediately  followed  by 
RepAdd (X , f ) " .  This  is  done  as  follows:  The  switch  forms  the  sum 
e  +  f,  transmits  the  combined  request  RepAdd (X , e  +  f )  ,  and  stores  the 
value  f  in  its  local  memoryCsee  Fig.  6).  When  the  value  Y  is 
returned  to  the  switch  (in  ;response  to  RepAdd (X , e  +  f ))  ,  Y  is 
returned  along  one  path  ('to  ^satisfy  the  original  request 
RepAdd(X,f))  and  Y-f  is  raturnedqalbng  another  path  (to  satisfy 
the  original  request  RepAdd  (X,:e )) .  Assuming  that  there  was  no 
other  conflict,  we  would  have  Y  =  X+e+f;  thus  the  values 
returned  along  these  two  paths  are  X+e  and  X+e+f,  and  the  memory 
location  X  is  properly  incremented,  becoming  X+e+f.  If  other 
RepAdd(X,g)  operations  are  simultaneously  processed,  the  combined 
requests  are  themselves  combined,  and  the  associativity  of 
addition  guarantees  that  the  procedure  gives  a  result  consistent 
with  the  serialization  principle. 


In  summary,  favorable  replace-add  conflicts  are  processed  as 
follows : 

1.  RepAdd-RepAdd .  As  described  above,  a  combined  request 
is  transmitted  and  the  result  used  to  satisfy  both 
replace-adds . 

2.  RepAdd-Load.   Treat  Load(X)  as  RepAdd(X,0). 

3.  RepAdd (X, e  )-Store(X, f)  .  Transmit  the  store  and  satisfy 
the  replace-add  by  returning  e+f. 
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As  seen  in  our  discussion  of  loads  and  stores,  this  scheme, 
which  reduces  communications  traffic,  also  exhibits  good  average 
case  performance. 


The  use  of  4  X  4  switches  to  improve  performance  of  the 
basic  communication  network  is  al  i^o  e.pplicable  for  the  enhanced 
network  just  described.  A  detailed  Cins-l^yis  and  hardware  design 
of   the   communication   scheiT.i's   desei'llvru   t'-ove   will  appear  in 


Rudolph  [81]. 


a  c   --  r.  ;  - 
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7.0   MANAGEMENT  OF  HIGHLY  PARALLEL  QUEUES 

Since   queues   are   a   central   data   structure    for    many 

algorithms,   a  concurrent  queue  access  method  can  be  an  important 

tool  for  constructing  parallel  programs.    In   analyzing   one   of 

their    parallel   shortest   path,  algorithms,   Deo   et   al .   [80] 

dramatize  the  need  for  this  tool j  _. - 

"However,  regardless .Qf -  the  number  of  processors 
used,  we  expect  that  algorithm  PPDM  has  a 
constant  upper  bound  on  its  speedup,  because 
every  processor  demands  private  use  of  the  Q." 

Refuting  this  pessimistic  conclusion,  we  show  in  this 
section  that,  although  at  first  glance  the  important  problem  of 
queue  management  may  appear  to  require  use  of  at  least  a  few 
inherently  serial  operations,  a  queue  can  be  shared  among 
processors  without  using  any  code  that  would  create  serial 
bottlenecks.  The  procedures  to  be  shown  maintain  the  basic 
first-in  first-out  property  of  a  queue,  whose  proper  formulation 
in  the  assumed  environment  of  large  numbers  of  simultaneous 
insertions  and  deletions  is  as  follows:  If  insertion  of  a  data 
item  p  is  completed  before  insertion  of  another  data  item  q  is 
started,  then  it  must  not  be  possible  for  a  deletion  yielding  q 
to  complete  before  a  deletion  yielding  p  has  started. 


In  the  algorithm  below  we  represent  a  queue  of  length  Size 
by  a  public  circular  array  Q[0:Size-1]  with  public  variables  I 
and  D  pointing  to  the  locations  of  the  items  last  inserted  and 
deleted   (these   correspond   to   the   rear  and  front  of  the  queue 
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respectively).  Thus  MOD (1+ 1 , Size  )  and  MOD ( D+ 1 , Size  )  yield  the 
locations  for  the  next  insertion  and  deletion,  respectively. 
Initially  I=D=0  (corresponding  to  an  empty  queue). 

We  maintain  two  additional  counters,  #Q1  and  #Qu,  which  hold 
lower  and  upper  bounds  respectively  for  the  number  of  items  in 
the  queue,  and  which  never  differ  by  more  than  the  number  of 
active  insertions  and  dele  tio:i,- .  ;  .lr.l  fc.""  ~JJ  5  *Ql  =  #Qu  =  0,  indicating 
no  activity  and  an  empty  queue,  TLe  ; eran  , tcrs  QueueOverf low  and 
QueueUnderf low  appearing  in  the  rcCm  shown  below  are  flags 
denoting  the  exceptional  conditions  th<?t  ocour  when  a  processor 
attempts  to  insert  into  a  full  queue  or  delete  from  an  empty 
queue.  (Since  a  queue  is  considered  full  when  #Qu  2.  Size  and 
since  deletions  do  not  decrement  #Qu  until  after  they  have 
removed  their  data,  a  full  queue  may  actually  have  cells  that 
could  be  used  by  another  insertion. )  The  actions  appropriate  for 
the  QueueOverf 1 ow  and  QueueUnderf 1 ow  conditions  are  application 
dependent:  One  possibility  is  simply  to  retry  an  offending 
insert  or  delete;  another  possibility  is  to  proceed  to  some 
other  task. 


Code  for  a  critical-section-free  implementation  of  Insert 
and  Delete  is  given  below.  The  insert  operation  proceeds  as 
follows:  First  a  test-incr ement-retes t  (TIR)  sequence  is  used  to 
guarantee  the  existence  of  space  for  the  insertion,  and  to 
increment  the  upper  bound  #Qu.  If  the  TIR  fails,  a  QueueOverf 1 ow 
occurs.    If   it   succeeds,   the  expression  Mod ( RepAdd ( 1 ,  1  )  , Size  ) 
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gives  the  appropriate  location  for  the  insertion,  and  the  insert 
procedure  waits  its  turn  to  overwrite  this  cell  (this  point  is 
discussed  at  the  end  of  this  section).  Finally,  the  lower  bound 
#Q1  is  incremented.  The  delete  operation  is  performed  in  a 
symmetrical  fashion;  the  deletion  of  data  can  be  viewed  as  the 
insertion  of  vacant  space. 


Procedure  Insert  (Data,  Q,  Que  •:;Ovanflow) 
If  TIR(#Qu, 1 ,Size)  Then   { 

Myl  <--  llodtBspAddd,  1  )  ,Size) 
Wait  turn  at  Myl 
Q[MyI]  <^-  Data 
RepAdd(#Ql, 1  ) 

QueueOverf low  <--. False   } 
Else     QueueOverf low  <--  True 
End  Procedure 

Procedure  Delete (Data, Q,QueueUnderflow) 
If  TDR(#Q1, 1 )  Then   { 

MyD  <--  Mod(RepAdd(D, 1 ) ,Size) 
Wait  turn  at  MyD 
Data  < —  Q[MyD] 
RepAdd(#Qu,-1  ) 
QueueUnderf low  <--  False   } 
Else     QueueUnderf low  <--  True 
End  Procedure 


Boolean  Procedure  TIR ( S , Del ta , Bound ) 
If  S+Delta  <.  Bound   Then 

If  RepAdd(S, Delta)  <.  Bound  Then  TIR  <--  true 
Else   {   RepAdd(S, -Delta) 
TIR  <--  false  } 
End  Procedure 


Boolean  Procedure  TDR(S, Delta) 
If  S-Delta  2  0   Then 

If  RepAdd(S, -Delta)  2  0  Then  TDR  <-• 
Else   {   RepAdd (S, Delta ) 
TDR  < —  false   } 
End  Procedure 


True 
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Although  the  initial  test  in  both  TIR  and  TDR  may  appear  to 
be  redundant,  a  closer  inspection  shows  that  their  removal 
permits  unacceptable  race  conditions.  This  point  is  discussed  in 
detail  in  Gottlieb  et  al .  [80]  where  other  replace-add  based 
software  primatives  are  presented  as  well. 


It  is  important  to  note  that  when  r.  queue  is  neither  full 
nor  empty  our  code  allows  many  .  inser  t5  o  i:s  and  many  deletions  to 
proceed  completely  in  paralle]  with  no  aerial  code  executed. 
This  should  be  contrasted  with  current  parellel  queue  algorithms, 
which  use  small  critical  sections  to  update  the  insert  and  delete 
pointers . 
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8.0   A  SCIENTIFIC  APPLICATION  FOR  CONCURRENT  QUEUES 


We  first  became  aware  of  the  potential  importance  of  code 
permitting  highly  concurrent  queue  access  during  the  developement 
of  a  parallel  program  for  radiation  transport.  Since 
computations  on  seperate  particles  are  independent,  any  number  of 
PEs  can  analyze  particles  asynchronously.  The  radiation 
transport  program  with  which -we  :were  experimenting,  maintains  a 
pool  of  particles;  during  processing,  each  PE  deletes  a  particle 
from  the  pool  and  then,  sf-ter  performing  some  calculations, 
inserts  zero  or  more  new  particles  back  into  the  pool.  A  queue 
was  used  to  represent  the  pool  and,  since  the  access  routines 
were  very  short,  we  intitially  treated  the  queue  as  a  serially 
reusable  resource  (i.e.  critical  sections  were  used).  However, 
simulation  (see  Gottlieb  [80])  of  our  programmed  solution  for 
this  problem  did  not  yield  the  expected  linear  speed-up: 
Addition  of  PEs  beyond  a  critical  number  (depending  upon  the 
complexity  of  the  physics  calculations)  did  not  decrease 
execution  time  by  the  expected  amount,  since  serial  queue  access 
had  become  a  significant  bottleneck.  But  subsequent  use  of  the 
highly  concurrent  queue  routines  shown  above  restored  the 
originally  expected  linear  speed-up.  Figure  7  indicates  some  of 
the  results  obtained. 
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9.0   EXECUTING  A  DATAFLOW  LANGUAGE 


Dataflow  hardware  and  software  have  been  considered  by  a 
number  of  authors;  see  Dennis  and  Misunas  [75]  and  Arvind  et  al. 
[78].  In  this  section  we  present  a  highly  parallel  paracomputer 
implementation  of  a  dataflow  language,  using  the  queue  management 
technique  of  the  preceding  section  to  schedule  the  execution  for 
dataflow  packets.  It  follows  from  this  that,  in  addition  to  its 
other  advantages,  our  paracomputer  can  be  regarded  as  a  plausible 
implementation  of  the  abstract  notion  of  "dataflow  machine". 
This  is  not  a  surprising  use  of  concurrent  queues  since  their 
major  intended  application  was  the  scheduling  kernel  of  a 
parallel  operating  system  (see  Gottlieb  et  al .  [80])  and  complete 
dataflow  packets  awaiting  execution  are  similar  to  the  "ready 
tasks"  of  an  operating  system.  From  this  viewpoint  the  primary 
difference  between  conventional  parallel  processors  and  dataflow 
machines  lies  in  the  granularity  of  parallelism  -  moderately 
large  tasks  for  the  former  vs.  atomic  operations  for  the  latter. 
It  is,  of  course,  the  applicative  nature  of  dataflow  languages 
that  permits  low-level  parallelism  to  be  exposed. 


High  level  dataflow  languages  such  as  VAL  and  ID  are  often 
transformed  into  a  digraph-based  intermediate  form  before  actual 
machine  code  is  generated.  After  presenting  an  informal  account 
of  a  simplified  digraph-based  dataflow  language,  we  indicate  how 
its  execution  may  be  supported  by  a  paracomputer. 
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The  vertices  of  the  digraph  correspond  to  the  primitive 
operations  of  the  high  level  dataflow  language  and  the  arcs 
correspond  to  the  flow  of  data  from  the  result  of  one  operation 
to  the  input  of  another.  Consider  the  following  example  from 
Dennis   et   al.   [77].    The   "assignment"    y  -'   (a  +  b)«(b  +  c)    is 

transformed  into 

-  0  w  n  c  _  : 


where  the  instruction  vertices  (often   called   dataflow   packets) 
have  the  following  format. 


Since  side  effects   are   forbidden   in   dataflow  languages,  any 

operation   may   be   performed   as   soon   as   its  input   data  are 

available;   specifically  the  two  additions  in  our  example  may  be 
performed  concurrently. 


The  idea  of  our  paracomputer  dataflow  implementation  is  to 
use  a  concurrently  accessible  queue  to  hold  completed  packets. 
To  do  this  we  first  load  the  vertices  of  the  dataflow  digraph 
into   the  paracomputer  public  menmory,  s-toring  with  each  node  the 
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number  of  operands  required  (i.e.   its   indegree).    During   this 

loading  phase,  vertices  found  to  have  indegree  zero  are  placed  on 

the  queue.   The  dataflow  program  can  now  be   executed   by   having 

each  PE  perform  the  following  procedure. 

While  True  Do  { 

Delete  an  item  from  the  qaeue 
Perform  the  indicated  operation 
For  each  result  destination.  D  Do  { 

Place  result  into  appropriate  space  in  D 
If  ReplaceAdd ( #OperandsRequl red (D  )  , -1 )  =  0  Then 
Insert  D  on  Q  } } 


In  a  subsequent  report  we  plan  to  consider  this  issue  in 
more  detail  and  to  examine  the  execution  of  a  dataflow  loop. 
Since  an  actual  implementation  would  likely  support  a  coarser 
grain  of  parallelism,  one  may  need  to  consider  larger  dataflow 
packets  corresponding  to  several  atomic  operations. 
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10.0   SIMULATION  AND  SCIENTIFIC  CODE  EXPERIMENTS 


At  NYU,  we  have  implemented  an  instruction  level 
paracomputer  simulator  (Gottlieb  [80])  and  are  using  it  to  study 
parallel  variants  of  scientific  codes.  Applications  already 
studied  include  radiation  transport,  incompressible  fluid  flow 
within  an  elastic  boundary,  a tmo s'iSfrer i c  modeling,  and  Monte  Carlo 
simulation  of  fluid  structu'r-e.-^^s'tec  , 
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The  goals  of  our  paracomputer  simulation  studies  are,  first, 
to  develop  methodologies  for  writing  and  debugging  parallel 
codes;  second,  to  predict  the  efficiency  which  future  large 
scale  parallel  systems  can  attain.  As  an  example  of  the  approach 
taken,  and  of  the  preliminary  results  obtained,  we  report  on  some 
experiments  with  a  parallelized  variant  of  the  code  TRED2  (taken 
from  Argonne's  EISPACK  library),  which  uses  Housholder's  method 
to  reduce  a  real  symmetric  matrix  to  tridiagonal  form  (see  Korn 
[81]  for  more  details). 


An  analysis  of  the  parallel  variant  of  this  code  shows  that 
the  time  required  to  reduce  an  N  by  N  matrix  using  P  processors 
is  well  approximated  by 

T(P,N)  =  AN  +  DN»«3/P  +  W(P,N) 
where  the  first  term  represents   "overhead"   code   that   must   be 
executed   by  all  PEs  (e.g.  loop  initializations),  the  second  term 
represents  work  that  is  divided  among  the  PEs,  and   W(P,N).    the 
waiting   time,   is   of   order   max ( N , P** . 5  )  .    We   determined  the 
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constants  experimentally  by  simulating  TRED2  for  several  (P,N) 
pairs  and  measuring  both  the  total  time  T  and  the  waiting  time  W. 
(Subsequent  runs  with  other  (P,N)  pairs  have  always  yielded 
results  within  1?  of  the  predicted  value.)  Table  2  summarizes 
some  of  our  experimental  results  and  predicts  the  efficiency  for 
problems  and  machines  too  large  for  us  to  simulate  (these  values 
appear  with  an  asterisk).  In  examining  this  table,  recall  that 
the  efficiency  of  a  parallel  computation  is  defined  as 
T(1,N) 


E(P,N)  = 


P»T(P, N) 


Although  we  consider  these  measured  efficiencies 
encouraging,  we  note  that  system  performance  can  probably  be 
improved  even  more  by  sharing  PEs  among  multiple  tasks. 
(Currently  the  simulated  PEs  perform  no  useful  work  while 
waiting.)  If  we  make  the  optimistic  assumption  that  all  the 
waiting  time  can  be  recovered,  the  efficiencies  rise  to  the 
values  given  in  table  3- 
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